September 30, 2020
We saw indispensable commands of R and Python for DataScience. Ofcourse what we saw is just a Wave and the Ocean to be crossed is still left out. But let us cross wave by wave, which will undoubtedly make us cross the ocean.
In this page we are going to see Data Preprocessing Techniques. Data preprocessing is nothing but preparing the data in proper format for further analysis. Like handling missing values etc. Below are some of the techniques.
Missing Value Analysis
Outlier Analysis
Feature Selection
Feature Scaling
Sampling Techniques
Lets see few words about every topic and every single topic will be elucidated in a separate post.
The Dataset that we have collected for analysis, may or may not have missing values, which means the column in the dataset may or may not have all the values. If the particular variable/column has missing values more than 35% we need to drop that column itself.If less than 35% please proceed with filling up missing values.
How do we get Missing Values
If the particular column has missing values more than 35%
If the particular column has missing values less than 35%
Calulate missing values using any one of above methods and identify which value is nearest one and follow that method.
Take the column in which missing value analysis to be done.
Remove one value which already exists, it can be any value from that column.
Now using above methods(Central statistics, KNN Imputation, Prediction Method (ML)) calculate the value.
We get one value for each method, now compare with the removed value.
Which is the nearest matching value, fix that method.
Now let us see Missing value analysis using R and Python. Now Exit and Move up for those topics.!!!